home *** CD-ROM | disk | FTP | other *** search
- {
- Robert Rothenburg
-
- Ok, since a few people have requested info about compression routines I
- thought some actual descriptions of the algorithm(s) involved would be a
- good idea. Rather than dig out someone else's text file or mail (which
- might be copyrighted or arcane) I'll try my hand at explaining a few
- methods.
-
- It seems better that programmers have at least a rudimentary understanding
- of the algortihms if they plan on using (or not using :) them.
-
- DISCLAIMER: Please pardon any innacuracies: What I know is based on other
- people's mail, text files or from what I've gleaned from spending a few
- hours at the library (adivce: your local college research-library is a
- wonderful resource of algorithms and source-code. Old magazines as well
- as academic papers like the IEEE Transactions are worth examining).
-
-
- In this insanely long post:
-
- I. "Garbage Collection"
- II. "Keywords"
- III. Run-Length Encoding (RLE)
- IV. Static and Dynamic Huffmann Codes <--(BTW, is it one or two n's?)
- V. Lampev-Ziv (LZ)
- VI. Lampev-Ziv-Welch (LZW) et al.
-
-
- I. "Garbage Collection"
-
- The simplest methods of compression in weeding out unnecessary data,
- especially from text files. Spaces at the end of a line, or extra
- carriage returns at the end of a file. (Especially with MS-DOS text
- files, which use a Carriage Return *and* Line Feed--many editors and
- file-browsers do not need both. Eliminating one of them can cut a large
- text file's size by a few percent!)
-
- I've also seen some utilities that "clean up" the headers of EXE
- files (such as UNP) by eliminating `unnecessary' info. Other
- utilities will clean up graphics files so that they'll compress
- better.
-
- In fact, removing excess "garbage" from any file will probably
- improve its compressability.
-
- II. "Keywords"
-
- Another way to compress text files is to use a "keyword" for each word.
-
- I.E. we can assume most text files will have less than 65,536 different
- words (16 bits=two bytes), and thus we can assign a different value for
- each word: since most words are longer than three letters (more than two
- bytes), we will save space in the file. However, we'll need some form of
- look-up table for each keyword--one way around this might be to use a
- reference to a standard dictionary file that is included with an operating
- system or word processor.
-
- (This has other advantages as well. Many BASIC interpreters will store
- commands like PRINT or INPUT as a character code rather than the entire
- name in ASCII text. Not only does this save memory but improves the
- run-speed of the program.)
-
- This method can be adapted to other (non-text) files, of course.
-
- III. Run-Length Encoding (RLE)
-
- With RLE, repeated characters (such as leading spaces in text, or large
- areas of one color in graphics) are expressed with a code noting which
- byte is repeated and how many times it's repeated.
-
- IV. Static and Dynamic Huffmann Codes
-
- The logic behind this method is that certain characters occur more often
- than others, and thus the more common characters can be expressed with
- fewer bits. For example, take a text file: 'e' might be the most common
- character, then 'a', then 's', then 'i', etc....
-
- We can express these characters in a bit-stream like so:
-
- 'e' = 1
- 'a' = 01 <--These are Static Huffmann Codes.
- 's' = 001
- 'i' = 0001
- etc...
-
- Since these characters normally take up 8-bits, the first (most common)
- seven will save space when expressed this way, if they occur often enough
- in relation to the other 249 characters.
-
- This is often represented as a (Static) Huffmann Tree:
-
- /\ Note that a compression routine
- 'e' = 1 0 will have to first scan the data
- / \ and count which characters are
- 'a' = 1 0 more common and assign the codes
- / \ appropriately.
- etc... 1 0
- / \
- etc...
-
- Notice that if we use the full 256 ASCII characters we'll run into
- a problem with long strings of 0's. We can get around that by using
- RLE (Which is what the compressor SQUEEZE uses).
-
- Again, since most files use a large range of the character set, and
- their occurences are much more "even", we can use use Dynamic Huffmann
- Codes as an alternative. They are like their static cousins, only after
- the string of n zeros they are followed by n (binary) digits:
-
- 1 = 1 1 character
- 01x = 010, 011 2 chars
- 001xx = 00100, 00101, 00110, 0011 4 chars
- 0001xxx = 0001000, 0001001 ... 0001111 8 chars
- etc...
-
- As you can guess, the Huffmann Tree would look something like:
-
- /\ It's a bit too complicated to
- 1 0 express with ASCII <g>
- / \
- 1 0
- / \ /\
- 1 0 etc...
-
- Huffmann Coding is based on how often an item (such as an ASCII
- character) occurs, and not where or in relation to other items.
- It's not the msot efficient mothod, but it is one of the simplest.
- One only needs to make an initial pass to count the characters and
- then re-pass translating them into the appropriate codes. No
- pattern-matching or search routines are required.
-
- V. Lampev-Ziv (LZ) Encoding.
-
- This method _does_ require some fast pattern-matching/search routines.
- It takes advantage of the fact that in most files (especially text)
- whole strings of characters repeat themselves.
-
- Take this sample line of text:
-
- "THAT WHICH IS, IS. THAT WHICH IS NOT, IS NOT. IS IT? IT IS!"
-
- (Ok, so "Flowers for Algernon" was on TV the other night... :)
- With LZ, we read in from the beginning of the file and keep adding
- groups ("Windows") of characters that have not already occurred.
- So we add the first sentence, then get to the second "IS"...instead
- of repeating it we note that it's a duplicate, with a reference
- to where the original occurred and how long the string is. (I.E. we note
- that the "IS" occurred 4 characters previously, and is two characters
- long.)
-
- This method can be tweaked by using a "Sliding Window" approach where the
- largest matches are found...we can compress the second "IS" with the first,
- but when don't gain much. However the two "IS NOT"s, when matched against
- each other, would compress better.
-
- Our compressed file will have two types of data: the raw characters and
- the LZ-references (containing the pointer and length) to the raw
- characters, or even to other LZ-references.
-
- One way to tweak this further is to compress the LZ-References using
- Huffmann codes. (I think this is what's done with the ZIP algorithm.)
-
- VI. Lampev-Ziv-Welch (LZW) -- Not to be confused with LZ compression.
-
- This is the method used by Unix COMPRESS, as well as the GIF-graphics
- file format.
-
- Like LZ, LZW compression can compress a stream of data on one-pass
- relatively quickly (assuming you've got the fast search routines!).
- However, LZW uses a hash table (essentially it's an array of strings,
- also known as a string table), and a "match" is expressed as its index
- in the hash table rather than a reference to where and how long the
- match is.
-
- The input stream is read in, strings are assembled and sent out when
- they are already in the hash table, or they are added to the hash table
- and sent out in such a way that a decoder can still re-assemble the
- file.
-
- Needless to say, this method is a memory hog. Most compressors that
- use this method are usually limited to a certain number of entries
- in the hash-table (12- or 16-bits, or 4096 and 65536 respectively).
-
- Here's an outline of the LZW Compression Algorithm:
-
- 0. a. Initialize the hash table. (That means for each possible
- character there is one "root" entry in the table. Some
- flavors of LZW may actually have a "null" non-character
- at the base of the table.)
- b. Initialize the "current" string (S) to null.
-
- 1. Read a character (K) from the input stream.
- (If there are none left, then we're done, of course.)
-
- 2. Is the string S+K in the string-table?
-
- 3. If yes, then a. S := S+K
- b. Go to Step 1.
-
- If no, then a. add S+K to the string table.
- b. output the code for S
- c. S := K
- d. Go to Step 1.
-
- Decompression is very similar.
-
- 0. a. Initialize the string table (the same as with compression).
- b. Read the first code from the compressed file into C
- c. Then output the equivalent string for code C
- d. Let code O := code C (The code, not the string!)
-
- 1. Read the next code from the input stream into C
-
- 2. Does code C exist in the string/hash table?
-
- 3. If yes, then a. output the string for code C
- b. S := equivalent string for code O
- c. K := first character of the string for code C
- d. add S+K to the string table
-
- If no, then a. S := equivalent string for code O
- b. K := first character of S
- c. output S+K
- d. add S+K to the table
-
- 4. code O := code C
-
- 5. Go to Step 1.
-
- It may seem psychotic at first, but it works. Just keep reading it and
- thinking about it. The best way is with a pencil and pad experimenting
- with a character set of four characters (A, B, C, and D) and a "word"
- like "ABACADABA" and following through the steps.
-
- (An actual walk-through is not included here, since it will take up
- *way* too much space. I recommend getting further, more detailed
- descriptions of LZW if you plan on writing an implementation.)
-
- One important thing is to *not* confuse between the `code' for a string
- (it's index in the hash table) and the string itself.
-
- Two points about LZW implementations:
-
- 1. The code stream (i.e. the compressed output) is not usually
- a set number of bits, but reflects how many entries are in
- the string table. This is another way to further compress
- the data.
-
- Usually, LZW schemes will output the code in the number of bits
- exactly reflecting the string table. Thus for the first 512
- codes the output is 9-bits. Once the 513th string is added to
- the table, it's 10-bits and so on.
-
- Some implementations will use a constant output until a threshold
- is reached. For example, the code output will always be 12-bits
- until the 4097th string is added, then the code output is in
- 16-bit segments etc...
-
- If these scenarios are used, the decompression routine *must*
- "think ahead" so as to read in the proper number of bits from
- the encoded data.
-
- 2. Because full string tables are incredible memory hogs, many
- flavors of LZW will use a "hash" table (see, string and hash
- table aren't quite the same).
-
- Instead of a string of characters like "ABC", there'll be
- a table containing the last character of the string, and a
- reference to the previous character(s). Thus we'll have
- "A", "[A]B" and "[[A]B]C" where [x] is the reference (or the
- actual "code") for that character/string. Using this method
- may be slower, but it saves memory and makes the implement-
- ation a little easier.
- There are many variations on LZW using other sets of strings to
- compare with the string table--this is based on the assumption that
- the more entries there are in the table, the more efficient the
- compression will be.
-
- One variation (Lampev-Ziv-Yakoo) would add the ButFirst(S+K) every
- time S+K was added. "ButFirst" means all characters but the first
- one. So ButFirst("ABCD") = "BCD".
-
-
- Anyhow, those are the compression algorithms that I know about which I
- can even remotely attempt to explain.
- I apologize for any (glaring?) mistakes. It's late...what started as a
- meaningless reply to somebody asking something about SQUEEZE exploded
- into this post. I hope it's useful to anybody interested.
- }